Accelerating C/C++ code with CUDA on GPUs

In this self-paced, hands-on lab, we will use CUDA C/C++ to accelerate code on NVIDIA GPUs.

Lab created by Mark Ebersole (Follow @CUDAHamster on Twitter)

The following timer counts down to a five minute warning before the lab instance shuts down. You should get a pop up at the five minute warning reminding you to save your work!

Before we begin, let's verify WebSockets are working on your system. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see get some output returned below the grey cell. If not, please consult the Self-paced Lab Troubleshooting FAQ to debug the issue.



In [1]:

    
print "The answer should be three: " + str(1+2)









    



The answer should be three: 3

Let's execute the cell below to display information about the GPUs running on the server.



In [2]:

    
!nvidia-smi









    



Tue Feb  7 22:18:37 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.29     Driver Version: 340.29         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 0000:00:03.0     Off |                  N/A |
| N/A   33C    P8    17W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

If you have never before taken an IPython Notebook based self-paced lab from NVIDIA, click this green box.

The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage. If you've never taken a lab on this sytem before, it's highly recommended you watch this short video first.

Introduction to CUDA C/C++

In this lab, we will learn how to write GPU code using C or C++, two of the ever growing list of CUDA enabled languages. By learning just a few new syntactic elements, as well as a few API calls, we'll be able to unlock the massively parallel capability of an NVIDIA GPU.

Watch the following short video introduction to Grids, Blocks, and Threads:

This lab consists of three tasks that will require you to modify some code, compile and execute it. For each task, a solution is provided so you can check your work or take a peek if you get lost.

If you are still confused now, or at any point in this lab, you can consult the FAQ located at the bottom of this page.

Hello Parallelism

For the first task, we are going to be using the following concepts:

__global__ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself. For CUDA C/C++, the nvcc compiler will handle compiling this code.
blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on.
threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block.
blockDim.x - this is a read-only variable that is defined for you. It simply returns a value indicating the number of threads there are per block. Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value.
myKernel <<< number_of_blocks, threads_per_block>>> (...) - this is the syntax used to launch a kernel on the GPU. Inside the triple-angle brackets we set two values. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. In this case, the system will just continue executing blocks until they have all run.

Let's explore the above concepts by doing a simple "Hello Parallelism" example.

There is nothing you need to do to the code to get this example to work. Before touching the code at all, select the next cell down and hit Ctrl-Enter (or the play button in the toolbar) to compile using the nvcc compiler from NVIDIA and run it. The nvcc compiler does the following basic steps:

From the .cu source file, separate code which should be compiled for the GPU and the code which should be compiled for the CPU
nvcc will compile the GPU code itself
nvcc will give the host compiler, in our case gcc, the CPU code to compile
Link the compiled code from #2 and #3 and create the executable

If everything is working, you should see the following: Hello from Thread 0 in block 0



In [3]:



In [6]:

    
# Execute this cell to compile & run the Hello Parallelism example labeled task1
!nvcc -arch=sm_30 -o task1_out task1/task1.cu -run









    



Hello from Thread 0 in block 1
Hello from Thread 1 in block 1
Hello from Thread 2 in block 1
Hello from Thread 3 in block 1
Hello from Thread 4 in block 1
Hello from Thread 0 in block 2
Hello from Thread 1 in block 2
Hello from Thread 2 in block 2
Hello from Thread 3 in block 2
Hello from Thread 4 in block 2
Hello from Thread 0 in block 0
Hello from Thread 1 in block 0
Hello from Thread 2 in block 0
Hello from Thread 3 in block 0
Hello from Thread 4 in block 0

Task #1

Your first task in this lab is to play around with the number of blocks, and threads per block by modifying the values in the <<< >>> brackets. To make these changes, click on the task1 folder in the editor below. After that you should be able to select task1.cu to open the file.

What do you expect to see if you do <<<2, 1>>>?
What do you expect to see if you do <<<1, 18>>>? Do the threads print out their values in order?
Can you make the values in the <<< >>> arbitrarily large?

After making a change, make sure to save the file by simply clicking the save button below. As a reminder, saving the file actually saves it on the Amazon GPU system in the cloud you're working on. To get a copy of the files we'll be working on, consult the Post-Lab section near the end of this page.

To compile and run, simply re-execute the cell above this area.

Your browser does not support iframes.

Congratulations! You have successfully modified, compiled and executed your first program on the GPU!

Besides just getting our hands dirty compiling and executing code on the GPU, task1 was meant to enforce a fundamental principle. If you set the number of threads per block to at least 18, you should have noticed the threads were no longer printing their Hello Parallelism strings in ascending order - some randomness started to creep in. The reason for this, is we're executing these blocks & threads on a massively parallel GPU. So there will be 100's if not 1000's of threads all executing simultaneously with respect to each other. It's basically a race condition to see who gets to print their values first.

In a more realistic example, like we'll work on below, we need to ensure there is enough memory allocated to hold the results from all our threads, so we do not have such a race condition.

Writing and Launching GPU Kernels

Task #2

Now that you have had some experience launching a function on the GPU with different numbers of threads, it's time to write your first GPU kernel yourself. You're going to be accelerating the ever popular SAXPY (Single-precision A times X Plus Y) function on the GPU using CUDA C/C++.

Using the concepts introduced in Task #1, modify the following code to run on the GPU. The #FIXME :...# text in the code will help you focus on the appropriate areas that need modification. You'll probably notice two new API calls in the code below; cudaMallocManaged and cudaFree. These two functions are working with managed memory using CUDA's Unified Memory system. We'll explore this in the last task of this lab. For the moment, you just need to know that they are replacing malloc and free, respectively.

In the text editor below, open the task2.cu file and begin working. If you get stuck, or just want to check your answer, feel free to look at the task2_solution.cu file.

Your browser does not support iframes.



In [8]:

    
# Execute this cell to compile & run task2.cu
!nvcc -arch=sm_30 -o task2_out task2/task2.cu -run









    



c[0] = 5, c[1] = 5, c[2] = 5, c[3] = 5, c[4] = 5, 
c[4194299] = 5, c[4194300] = 5, c[4194301] = 5, c[4194302] = 5, c[4194303] = 5,

The output of your program should be all 5's. If you got this, you have successfully modified the saxpy function to execute on the massively parallel GPU! If you are still not able to get the correct output, please have a look at the task2_solution.cu file and see if you can figure out what you were missing!

Task #3

The next function we will accelerate is a basic matrix multiplication function. In this simplified example, we'll assume our matrices are all square - they have the same number of rows and columns. Your goal is to modify the matrixMulGPU function with CUDA so it will run on the GPU. However, there is a new twist! Instead of just using one-dimensional blocks of threads and blocks, we'll be using two dimensions; x and y. So, in addition to using blockIdx.x, blockDim.x, and threadIdx.x, you'll also need to use blockIdx.**y**, blockDim.**y**, and threadIdx.**y**.

Besides replacing the #FIXME: ...# blocks in the matrixMulGPU function, you will need to finish initializing the number_of_blocks variable in the main function to launch the appropriate number of thread blocks. This is all marked with #FIXME: ...# in the code.

Please make use of the hints provided if you get stuck, and you can always check the task3_solution.cu file to see the answer.

Note: do not modify the CPU version matrixMulCPU. This is used to verify the results of the GPU version.

Hint #1

You'll be using a single thread to calculate one element of the output matrix `c`. Each thread will execute the inner-most loop of the matrix multiplication formula.

Hint #2

The `row` value should be set to `blockIdx.x * blockDim.x + threadIdx.x` to get the thread which is calculating the row part of the `c` matrix. Can you figure out what the `col` value should be set to?

Hint #3

Don't forget the `__global__` keyword to tell the compiler that `matrixMulGPU` is now a GPU function!

Hint #4

Just like the CPU version, we need to make sure we do not do more work than we have data. Make sure you're doing a check in your modified code so that `row` **and** `col` are less than N.

Your browser does not support iframes.



In [11]:

    
# Execute this cell to compile & run task3.cu
!nvcc -arch=sm_30 -o task3_out task3/task3.cu -run









    



Success!

Error Handling

If you started changing with the number of blocks and threads per block in the above examples, you may have noticed some cases where you were not getting the expected answer. Up to this point we have not added any type of error checking, which makes it very hard to tell why a problem is occurring. Error checking is just as important when programming for a GPU as for a CPU. So let's add some error checking to the SAXPY example and see if we can introduce some errors to catch.

Note: In order to focus on a specific topic, and to keep code and instructions short, a number of examples in this lab leave out error checking code. However, it is highly encourage you include error checking in your code whenever possible!

Task #4

The SAXPY code for this task is slightly different than in Task #2. For starters, we are storing the result of our computation back into the y array instead of a separate array. We're also using float values for our arrays instead of integers. On the host side, after calculating our SAXPY result, the code will check for correctness. If everything is working, the program will print Max Error: 0.00000.

Now let's discuss the different types of errors we need to check for. Kernels are launched asynchronously with respect to the host, meaning that control returns to the host code immediately after a kernel is launched but before the kernel completes. As a result we need to check for two types of error; synchronous errors dealing with the kernel launch and asynchronous errors that can happen during kernel execution. To check for synchronous errors we use the function cudaGetLastError(), which returns an integer error code. In the example below we compare the result of cudaGetLastError() and if it does not equal cudaSuccess (defined in the cudaFor module), we have an error:

cudaError_t ierrSync;
...
myKernel<<<grid, blocks>>>(...);
ierrSync = cudaGetLastError();
if (ierrSync != cudaSuccess) { printf("Sync error: %s\n", cudaGetErrorString(ierrSync)); }

To check for an error during execution we need to wait for the kernel to complete. This is accomplished using the function cudaDeviceSynchronize(). As with all CUDA API calls, cudaDeviceSynchronize() returns an integer error code and used in this fashion will capture any error that occurs on the device while executing the kernel. Example usage:

cudaError_t ierrAsync;
...
myKernel<<<grid, blocks>>>(...);
ierrAsync = cudaDeviceSynchronize();
if (ierrAsync != cudaSuccess) { printf("Async error: %s\n", cudaGetErrorString(ierrSync)); }

In addition to the above code changes, there is a logic bug introduced in the code below. Your objective for Task #4 is to add both types of error checking to the source code and try to find the bug. Before modifying any code, compile and run the programs as it is and see the result. Then add error checking and see if you can figure out what the bug is and fix it. As always, the solution is provided so you can check your work, or look at if you get stuck.

Hint #1

An error string of "invalid configuration argument" usually indicates there are to many threads per block being used.

Hint #2

The max number of threads in the X-dimension for our GPU is 1024, but you could choose a value less than that as well.

Your browser does not support iframes.



In [20]:

    
# Execute this cell to compile & run task4.cu
!nvcc -arch=sm_30 -o task4_out task4/task4.cu -run









    



Max Error: 0.00000

It's very important to consistently add error checking into your code. We routinely get questions regarding problems with someone's code, only to see there is no error checking. Almost always once error checking is added, the problem becomes obvious and is easily fixed.

Querying the GPU Device

The CUDA C/C++ device management API allows a programmer to query the number of devices available on a system and the features and capabilities of each device. The simple code below illustrates use of the device management API. After the number of CUDA-enabled devices attached to the system is determined via cudaGetDeviceCount(), a loop over these devices is performed (note that devices are enumerated from 0) and the function cudaGetDeviceProperties() is used to return information about a device in a variable of type cudaDeviceProp.

function deviceQuery
{
  cudaDeviceProp prop;
  int nDevices=0, i, 
  cudaError_t ierr;

  ierr = cudaGetDeviceCount(&nDevices);

  for( i = 0, i < nDevices; ++i )
  {
     ierr = cudaGetDeviceProperties(&prop, i);
     printf("Device number: %d\n", i);
     printf("  Device name: %s\n", prop.name);
     printf("  Compute capability: %d.%d\n", prop.major, prop.minor);
     printf("  Max threads per block: %d\n", prop.maxThreadsPerBlock);
     printf("  Max threads in X-dimension of block: %d\n", prop.maxThreadsDim[0]);
     printf("  Max threads in Y-dimension of block: %d\n", prop.maxThreadsDim[1]);
     printf("  Max threads in Z-dimension of block: %d\n\n", prop.maxThreadsDim[2]);
  }
}

The type cudaDeviceProp contains many fields, only a few are demonstrated in this code. One value demonstrated is the compute capability, which indicates the GPU architecture and is given in major.minor format. The major component reflects the generation of the architecture, and the minor component the revision within that generation. All Fermi generation GPUs have a compute capability of 2.x, Kepler GPUs 3.x, and Maxwell GPUs 5.x. In addition to the compute capability, this code prints the maximum number of threads per block possible as well as the maximum number of threads in each dimension of the block.

Task #5

For our next task, let's modify a 2D version of the SAXPY code to check for valid launch configuration arguments before launching the kernel. The launch configuration parameters to check are the number of threads per block and number of blocks, in both the X and Y dimension.

Your browser does not support iframes.



In [23]:

    
# Execute this cell to compile & run task5.cu
!nvcc -arch=sm_30 -o task5_out task5/task5.cu -run









    



Max Error: 0.00000

Try entering different sizes in the block dimension line, dim3 threads_per_block(32,16,1);, and make sure your new GPU device property check works correctly!

As you begin writing GPU code that could possibly run on multiple or different types of GPUs, you should use the ability to easily query each device to determine the optimal configuration for your code.

Memory Management

It is important to realize that the GPU has it's own physical memory; just like the CPU uses system RAM for it's memory. When executing code on the GPU, we have to ensure any data it needs is first copied across the PCI-Express bus to the GPU's memory before we launch our kernels.

Before the release of CUDA 6, it was a requirement that you, the developer, handle allocation and movement of data between the two memory spaces. This was, and can still be done with the following minimal set of API calls (detailed documentation here at docs.nvidia.com):

cudaMalloc ( void** devPtr, size_t size ) - this API call is used to allocate memory on the GPU, and is very similar to using malloc on the CPU. You provide the addres of a pointer that will point to the memory after the call completes successfully, as well as the number of bytes to allocate.
cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind ) - also very similar the standard memcpy, this API call is used to copy data between the CPU and GPU. It takes a destination pointer, a source pointer, the number of bytes to copy, and the fourth parameter indicates which direction the data is traveling: GPU->CPU, CPU->GPU, or GPU->GPU.
cudaFree ( void* devPtr ) - we use this API call to free any memory we allocated on the GPU.

Starting in CUDA 6, this required manual handling of data movement has been removed with the release of Unified Memory. Unified Memory creates an area of managed memory, and the underlying system handles moving this managed data between CPU and GPU memory spaces when required. Because of this, getting an application executing on an NVIDIA GPU with CUDA C/C++ has become a much quicker and efficient process - including simpler to maintain code.

To make use of this managed memory area with Unified Memory, you use the following API calls.

cudaMallocManaged ( T** devPtr, size_t size ); - allocate size bytes in managed memory and store in devPtr.
cudaFree ( void* devPtr ) - we use this API call to free any memory we allocated in managed memory.

Once you have used cudaMallocManaged to allocate some data, you just use the pointer in your code, regardless of whether it's the CPU or GPU accessing the data. Before Unified Memory, you typically had two pointers associated with data; one for CPU memory and one for GPU memory (usually using the GPU name preceded with a d_ to indicate device memory).

Managed memory is synchronized between memory spaces at kernel launch and any device synchronization points. Please visit the CUDA documentation page on Unified Memory to read about it in more detail.

Task #6

For this task, you need to modify the code in task6.cu so it makes use of Unified Memory instead of the manual data management it currently uses. If you get stuck, there are a number of hints provided - just click on the green box to see what they are. After modifying the code, you should get this exact output if you successfully moved to using managed memory.

On device: name=hello, value=10
On host: name=dello, value=11

Hint #1

There should be no calls to cudaMemcpy or cudaMalloc in your final solution.

Hint #2

You do not need to modify the Kernel GPU function.

Hint #3

Don't forget that data is synchronized between memory spaces at kernel launch and device synchronization points. In addition, launching a kernel from the CPU is asynchronous from the CPU's perspective. It will continue executing the next line even if the GPU has not completed the kernel. You will need to add a cudaDeviceSynchronize() call after the kernel launch for the managed memory data synchronization to happen.

Hint #4

Any data that will be shared between the CPU and GPU should be allocated with cudaMallocManaged, not the host-only malloc API.

Your browser does not support iframes.



In [ ]:

    
# Execute this cell to compile & run task6.cu
!nvcc -arch=sm_30 -o task6_out task6/task6.cu -run

You can see why Unified Memory is so appealing - it removes the requirement for complex data management code. Allowing you to get your functions executing on the GPU with less developer effort.

Case Study

Task #7

The last task in this lab is a case study to practice everything you have learned so far. Your goal will be to convert the step_kernel_mod function to execute on the GPU, as well as modify the main function to handle data movement. It's recommended you first attempt this task with Unified Memory, and then if you are interested, try using cudaMalloc and cudaMemcpy.

In task7.cu, there is a step_kernel_ref function that executes on the CPU which will be used for error checking. This function, and the associated calling code in main should not be modified. The error checking is done at the end of main and reports the Max Error between the two results. Because this code involves floating point calculations, it's important to point out that with the IEEE 754 specification, it's likely that different processors, or even simply reording operations on the same processor, can result in slightly different results. This is why the error checking code uses an error threshold, instead of looking for an exact match. You can read about this concept in detail in the Floating Point and IEEE 754 Compliance for NVIDIA GPUs paper written by Nathan Whitehead and Alex Fit-Florea.

Credit for the original Heat Conduction CPU source code in this task is given to the article An OpenACC Example Code for a C-based heat conduction code from the University of Houston.

NOTE You may want to proceed to the Post-Lab section and download this IPython Notebook as well as the source code now, just in case you run out of time before you are done.

Your browser does not support iframes.



In [ ]:

    
# Execute this cell to compile & run task7.cu
!nvcc -arch=sm_30 -o task7_out task7/task7.cu -run

Learn More

If you are interested in learning more, you can use the following resources:

Learn more at the CUDA Developer Zone.
If you have an NVIDIA GPU in your system, you can download and install the CUDA tookit.
Take the fantastic online and free Udacity Intro to Parallel Programming course which uses CUDA C.
Search or ask questions on Stackoverflow using the cuda tag

Post-Lab

Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!

Save this IPython Notebook by going to File -> Download as -> IPython (.ipynb) at the top of this window
You can execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.



In [ ]:

    
%%bash
rm -f cuda_c_files.zip
zip -r cuda_c_files.zip task*/*

After executing the above cell, you should be able to download the zip file here

Lab FAQ

Q: I'm encountering issues executing the cells, or other technical problems?
A: Please see this infrastructure FAQ.

Q: I'm getting unexpected behavior (i.e., incorrect output) when running any of the tasks.
A: It's possible that one or more of the CUDA Runtime API calls are actually returning an error, but the code above is not checking for this (due to the time constraints of this lab). Try checking the return value of any CUDA Runtime API call and see if the value is non-zero indicating an error.